22 research outputs found
Training IBM Watson Using Automatically Generated Question-Answer Pairs
IBM Watson is a cognitive computing system capable of question answering in natural languages. It is believed that IBM Watson can understand large corpora and answer relevant questions more effectively than any other question-answering system currently available. To unleash the full power of Watson, however, we need to train its instance with a large number of well-prepared question-answer pairs. Obviously, manually generating such pairs in a large quantity is prohibitively time consuming and significantly limits the efficiency of Watsonâs training. Recently, a large-scale dataset of over 30 million question-answer pairs was reported. Under the assumption that using such an automatically generated dataset could relieve the burden of manual question-answer generation, we tried to use this dataset to train an instance of Watson and checked the training efficiency and accuracy. According to our experiments, using this auto-generated dataset was effective for training Watson, complementing manually crafted question-answer pairs. To the best of the authorsâ knowledge, this work is the first attempt to use a large-scale dataset of automatically generated question-answer pairs for training IBM Watson. We anticipate that the insights and lessons obtained from our experiments will be useful for researchers who want to expedite Watson training leveraged by automatically generated question-answer pairs
Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation
Research on Korean grammatical error correction (GEC) is limited compared to
other major languages such as English and Chinese. We attribute this
problematic circumstance to the lack of a carefully designed evaluation
benchmark for Korean. Thus, in this work, we first collect three datasets from
different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide
range of error types and annotate them using our newly proposed tool called
Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a
carefully designed edit alignment & classification tool that considers the
nature of Korean on generating an alignment between a source sentence and a
target sentence, and identifies error types on each aligned edit. We also
present baseline models fine-tuned over our datasets. We show that the model
trained with our datasets significantly outperforms the public statistical GEC
system (Hanspell) on a wider range of error types, demonstrating the diversity
and usefulness of the datasets.Comment: Add affiliation and email addres
Automated Crowdturfing Attacks and Defenses in Online Review Systems
Malicious crowdsourcing forums are gaining traction as sources of spreading
misinformation online, but are limited by the costs of hiring and managing
human workers. In this paper, we identify a new class of attacks that leverage
deep learning language models (Recurrent Neural Networks or RNNs) to automate
the generation of fake online reviews for products and services. Not only are
these attacks cheap and therefore more scalable, but they can control rate of
content output to eliminate the signature burstiness that makes crowdsourced
campaigns easy to detect.
Using Yelp reviews as an example platform, we show how a two phased review
generation and customization attack can produce reviews that are
indistinguishable by state-of-the-art statistical detectors. We conduct a
survey-based user study to show these reviews not only evade human detection,
but also score high on "usefulness" metrics by users. Finally, we develop novel
automated defenses against these attacks, by leveraging the lossy
transformation introduced by the RNN training and generation cycle. We consider
countermeasures against our mechanisms, show that they produce unattractive
cost-benefit tradeoffs for attackers, and that they can be further curtailed by
simple constraints imposed by online service providers